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Abstract 



Prediction error is critical to assessing the performance of statistical methods and selecting 
statistical models. We propose the cross-validation and approximated cross-validation methods for 
estimating prediction error under a broad q-class of Bregman divergence for error measures which 
embeds nearly all of the commonly used loss functions in regression, classification procedures 
and machine learning literature. The approximated cross-validation formulas are analytically 
derived, which facilitate fast estimation of prediction error under the Bregman divergence. We 
then study a data-driven optimal bandwidth selector for the local-likelihood estimation that 
minimizes the overall prediction error or equivalently the covariance penalty. It is shown that 
the covariance penalty and cross-validation methods converge to the same mean-prediction-error- 
criterion. We also propose a lower-bound scheme for computing the local logistic regression 
estimates and demonstrate that it is as simple and stable as the local least-squares regression 
estimation. The algorithm monotonically enhances the target local-likelihood and converges. The 
idea and methods are extended to the generalized varying-coefficient models and semiparametric 
models. 

Key words and Phrases: Cross-validation; Exponential family; Generahzed varying-coefficient 

model; Local likelihood; Loss function; Prediction error. 

Short title: Assessing Prediction Error under Bregman Divergence 
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1 Introduction 



Assessing prediction error lies at the heart of statistical model selection and forecasting. Depending 
on the needs of statistical learning and model selection, the quadratic loss is not always appropriate. 
In binary classification, for example, the misclassification error rate is more suitable. The corre- 
sponding loss, however, does not differentiate the predictive powers of two procedures which forecast 
the class label being 1 with probabilities 60% and 90% respectively. When the true class label is 1, 
the second procedure is more accurate, whereas when the true class label is 0, the first procedure is 
more accurate. The quantification of the predicative accuracy requires an appropriate introduction 
of loss functions. An example of this is the negative Bernoulli log-likelihood loss function. Other 
important margin-based loss functions have been introduced for binary classification in the machine 
learning literature (Hastie, Tibshirani and Friedman 2001). Hence, it is important to assess the 
prediction error under a broader class of loss functions. 

A broad and important class of loss functions is the Bregman g-class divergence. It accounts for 
different types of output variables and includes the quadratic loss, the deviance loss for exponential 
family of distributions, misclassification loss and other popular loss functions in machine learning. 
See Sectional Once a prediction error criterion is chosen, the estimates of prediction error are needed. 
Desirable features include computational expediency and theoretical consistency. In the traditional 
nonparametric regression models, the residual-based cross-validation (CV) is a useful data-driven 
method for the automatic smoothing (Wong 1983; Rice 1984; Hardle, Hall, and Marron 1992; Hall 
and Johnstone 1992) and can be handily computed. With the arrival of the optimism theorem (Efron 
2004), estimating the prediction error becomes estimating covariance-penalty terms. Following Efron 
(2004), the covariance-penalty can be estimated using model-based bootstrap procedures. A viable 
model-free method is the cross- validated estimation of the covariance-penalty. Both methods can be 
shown to be asymptotically equivalent to the first order approximation. However, both methods are 
extremely computationally intensive in the context of local-likelihood estimation, particularly for the 
large sample sizes. The challenge then arises from efficient computation of the estimated prediction 
error based on the cross-validation. 

The computational problem is resolved via the newly developed approximate formulas for the 
cross-validated covariance-penalty estimates. A key component is to establish the "leave-one-out 
formulas" which offer an analytic connection between the leave-one-out estimates and their "keep- 
all-in" counterparts. This technical work integrates the infinitesimal perturbation idea (Pregibon 
1981) with the Sherman- Morrison- Woodbury formula (Golub and Van Loan 1996, p. 50). It is 
a natural extension of the cross-validation formula for least-squares regression estimates, and is 
applicable to both parametric and nonparametric models. 

The applications of estimated prediction error pervade almost every facet of statistical model 
selection and forecasting. To be more specific, we focus on the local-likelihood estimation in varying 
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coefficient models for response variables having distributions in the exponential family. Typical 
examples include fitting the Bernoulli distributed binary responses, and the Poisson distributed 
count responses, among many other non-normal outcomes. As a flexible nonparametric model- 
fitting technique, the local-likelihood method possesses nice sampling properties. For details, see, for 
example, Tibshirani and Hastie (1987), Staniswalis (1989), Severini and Staniswalis (1994), and Fan, 
Heckman, and Wand (1995). An important issue in application is the choice of smoothing parameter. 
Currently, most of the existing methods deal with the Gaussian type of responses; clearly there is a 
lack of methodology for non-Gaussian responses. The approximate cross-validation provides a simple 
and fast method for this purpose. The versatility of the choice of smoothing parameters is enhanced 
by an appropriate choice of the divergence measure in the g-class of loss functions. 

The computational cost of the approximate CV method is further reduced via a newly introduced 
empirical version of CV, called ECV, which is based on an empirical construction of the "degrees 
of freedom", a notion that provides useful insights into the local-likelihood modeling complexity. 
We propose a data-driven bandwidth selection method, based on minimizing ECV, which will be 
shown to be asymptotically optimal in minimizing a broad g-class of prediction error. Compared 
with the two-stage bandwidth selector of Fan, Farmen, and Gijbels (1998), our proposed method has 
a broader domain of applications and can be more easily understood and implemented. 

Some specific attentions are needed for the local logistic regression with binary responses, whose 
distribution belongs to an important member of the exponential family. To address the numerical 
instability, we propose to replace the Hessian matrix by its global lower-bound (LB) matrix, which 
does not involve estimating parameter vectors and therefore can easily be inverted before the start of 
the Newton- Raphson (NR) iteration. A similar idea of LB was used in Bohning and Lindsay (1988) 
for some parametric fitting. We make a conscientious effort to further develop this idea for the local 
logistic estimation. The resulting LB method gains a number of advantages: The LB algorithm, at 
each iteration, updates the gradient vector but does not recalculate the Hessian matrix, thus is as 
simple and stable as the local least-squares regression estimation. The LB method ensures that each 
iterative estimate monotonically increases the target local-likelihood. In contrast, this property is 
not shared by the standard NR method. Hence, the LB iteration is guaranteed to converge to the 
true local MLE, whereas the NR is not necessarily convergent. Moreover, we develop a new and 
adaptive data-driven method for bandwidth selection which can effectively guard against under- or 
over-smoothing. 

The paper is organized as follows. Section |21 addresses the issue of estimating prediction error. 
Sections|21develops computationally feasible versions of the cross- validated estimates of the prediction 
error. Section 0] proposes a new bandwidth selection method for binary responses, based on the LB 
method and the cross- validated estimates of the prediction error. Sections [SHHl extend our results to 
generalized varying-coefficient models and semiparametric models respectively. Section [7] presents 
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simulation evaluations and Section|Hlanalyzes real data. Technical conditions and proofs are relegated 
to the Appendix. 

2 Estimating Prediction Error 

To begin with, we consider that the response variable Y given the vector X of input variables has a 
distribution in the exponential family, taking the form, 

/yix(y; e{x)) = exp[{y^(x) - b{9{x))}/a{i;) + c{y, V)], (2.1) 

for some known functions a(-), and c(-, •), where 9{x) is called a canonical parameter and tp is 
called a dispersion parameter, respectively. It is well known that 

m(x) = E{Y\X = x) = b'{e{x)), and ^^(x) = var(y|X = x) = a{iP)b" {9{x)). 

See Nelder and Wedderburn (1972) and McCullagh and Nelder (1989). The canonical link is g{-) = 
{b')~^{-), resulting in g{m(x)) = 0{x). For simplicity of notation and exposition, we will focus only 
on estimating the canonical parameter ^(x). The results can easily be generalized to other link 
functions. 

2.1 Bregman Divergence 

The prediction error depends on the divergence measure. For non-Gaussian responses, the quadratic 
loss function is not always adequate. For binary classification, a reasonable choice of divergence 
measure is the misclassification loss, Q{Y,iri) = 1{Y ^ I(m > .5)}, where !(•) is an indicator 
function and fh is an estimator. However, this measure does not differentiate the predictions m = .6 
and fh = .9 when 1" = 1 or 0. In the case that y = 1, m = .9 gives a better prediction than 
fh = .6. The negative Bernoulli log-likelihood, Q{Y,fh) = — yin(m) — (1 — y)ln(l — fh), captures 
this. Other loss functions possessing similar properties include the hinge loss function, Q(Y,fh) = 
max{l — {2Y — l)sign(m — .5), 0}, in the support vector machine and the exponential loss function, 
Q(Y,fh) = exp{ — (y — .5)ln(m/(l — fh))}, popularly used in AdaBoost. These four loss functions, 
shown in Figure ^ belong to the margin-based loss functions written in the form, V(Y*F), for 
y* = 2y — 1 and some function F. 

[ Put Figure^about here ] 

To address the versatility of loss functions, we appeal to a device introduced by Bregman (1967). 
For a concave function g(-), define a g-class of error measures Q as 

Q(y, fh) = q{fh) + q'{fh){Y - fh) - q{Y). (2.2) 
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A graphical illustration of Q associated with q is displayed in Figure |2 Due to the concavity of 
Q is non-negative. However, since Q(-,-) is not generally symmetric in its arguments, Q is not a 
"metric" or "distance" in the strict sense. Hence, we call Q the Bregman "divergence" (BD). 

[ Put Figure^ahout here ] 

It is easy to see that, with the flexible choice of q, the BD is suitable for a broad class of error 
measures. Below we present some notable examples of the Q-loss constructed from the g-function. 

• A function qi{m) = am — m? for some constant a yields the quadratic loss Qi{Y, m) = (Y — m)'^. 

• For the exponential family (|2.H) . the function q2{m) = 2{b{9) — mO} with b'{6) = m results in 
the deviance loss, 

Q2{Y,m) = 2{Y{e- 9) - b{9) + b{e)}, (2.3) 
where b'{9) = Y and b'(9) = fh. 

• For a binary response variable Y, the function q{m) = min{m-, (1 — m)} gives the misclassi- 
fication loss; the function q{m) = .5min{m, (1 — m)} results in the hinge loss; the function 
q-i{m) = 2{m(l — m)}^/^ yields the exponential loss, 

Q3(y,m) = exp{-(y-.5)ln(m/(l-a))}. (2.4) 
2.2 Prediction Error under Bregman Divergence 

Let rrii = rniXi) and fhi be its estimate based on independent observations {(Xj, yi)}"^;^. Set 

eiii = Q{Yi,irii) and Erii = Eo{Q{Y°,irii)}, 

where Yf is an independent copy of Yi and is independent of (li,...,l^), and Eg refers to the 
expectation with respect to the probability law of Y°. Note that the conditional prediction error, 
defined by Err^, is not observable, whereas the apparent error, err^, is observable. As noted in 
Tibshirani (1996), directly estimating the conditional prediction error is very difficult. Alternatively, 
estimating Errj is equivalent to estimating the difference Oi = Errj — errj, called optimism. 

Efron (2004) derives the optimism theorem to represent the expected optimism as the covariance 
penalty, namely, E{Oi) = 2cov(Aj,li), where Aj = —q' {fhi) /2. As a result, the predictive error can 
be estimated by 

Err j = err j + 2 cov j , (2-5) 

where covj is an estimator of the covariance penalty, cov(Ai, Yi). This is an insightful generalization 
of AIC. Henceforth, the total prediction error Err = 'Ym=i ^^"^ ^e estimated by Err = 'Ym=i Erfi- 
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2.3 Estimation of Covariance Penalty 

In nonparametric estimation, we write rhi^h and Xi^h to stress their dependence on a smoothing 
parameter h. According to H2.5|l . the total prediction error is given by 

n n 

= J^Q(y„m,,;,) + J];25ot(V,^.)- (2-6) 

1=1 i=l 

Estimating the covariance-penalty terms in (|2.6|) is not trivial. Three approaches are potentially 
applicable to the estimation of the covariance penalty: model-based bootstrap developed in Efron 
(2004), the data-perturbation (DP) method proposed by Shen, Huang and Ye (2004), and model-free 
cross-validation. All three are computationally intensive in the context of local-likelihood estimation 
(the DP method also needs to select a perturbation size). In contrast, the third method allows us 
to develop approximate formulas to significantly gain computational expedience. For this reason, we 
focus on the cross-validation method. 

The cross- validated estimation of Err^ is Q{Yi,fh^JJ, where the superscript —i indicates the 
deletion of the ith data point (Xj,l^) in the fitting process. This yields the cross-validated estimate 
of the total prediction error by 

n 

^T''''{h)=Y,QiYi,m7l). (2.7) 

i=l 

Naive computation of {frij^l} is intensive. Section |31 will devise strategies by which actual compu- 
tations of the leave-one-out estimates are not needed. A distinguished feature is that our method 
is widely applicable to virtually all regression and classification problems. The approximated CV is 
particularly attractive to a wide array of large and complex problems in which a quick and crude 
selection of the model parameter is needed. 

By comparing ()2.7|) with ()2.6|) . the covariance penalty in (|2.7p is estimated by 

n 

^{Q{Yi,m7l) - Q{Yi,mi,h)}. 

i=l 

This can be linked with the jackknife method for estimating the covariance penalty. Hence, it is 
expected that the cross-validation method is asymptotically equivalent to a bootstrap method. 

2.4 Asymptotic Prediction Error 

To gain insight on Err(/i), we appeal to the asymptotic theory. Simple algebra shows that Errj(/i) = 
Qimi^rhi^h) + E{Q{Yi,'mi)}. By Taylor's expansion and 1)2. 2() . 
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Hence, 

Err,(/i) = -{mi^h-m,f(^'{fhi,h)/2 + E{Q{Yi,m,)}. 

Note that the last term does not depend on h and hence Err(/i) is asymptotically equivalent to the 
mean-prediction-error-criterion, 

MPEC(/i) = -2-^ [ E[{mhix) - m{x)}^\X]q" {m{x))fxix)dx, (2.8) 



with X = (Xi, . . . ,X„) and fx{x) being the probability density of X. This criterion differs from the 
mean-integrated-squared-error criterion defined by 



MISE(/i) = j E[{mh{x) - m{x)y\X]{b"{e{x))r'fx{x)dx, (2.9) 

recalling that 

9{x) - e{x) = {b"{9{x))}-^{mh{x) - m{x)}. 

Expression (|2.8)) reveals that asymptotically, different loss functions automatically introduce dif- 
ferent weighting schemes in (|2.8|) . This provides a useful insight into various error measures used 
in practice. The weighting schemes vary substantially over the choices of q. In particular, for the 
(7i-function yielding the quadratic-loss in Section [2.11 we have 

MPECi(/i) = J E[{mh{x) - m{x)}^\X]fxix)dx. 

For the g'2-function producing the deviance-loss, we have 

MPEC2(/i) = j E[{mh{x) - m{x)}^\X]{b"{9{x))}-^fx{x)dx. 

For the ^a-function inducing the exponential-loss for the binary responses, we have 

MPECs{h)= [ E[{m,{x) - m{x)}'\X] dx. 

3 Approximate Cross- Validation 

This section aims at deriving the approximate and empirical versions of H2.7() for the local maximum 
likelihood estimator. We focus on the univariate problem in this section. The results will be extended 
to the generalized varying coefficient models in Section [S] incorporating multivariate covariates. 

Assume that the function 9{-) has a (p + l)-th continuous derivative at a point x. For Xj close 
to X, the Taylor expansion implies that 

e{Xj)=^,{xf(3ix), 
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in which Xj(x) = (1, {Xj — x), . . . , {Xj — xY)'^ and (3{x) = (/?o(x), . . . , (3p{x))'^ . Based on the inde- 
pendent observations, the local parameters can be estimated by maximizing the local log-likelihood, 

n 

e{/3;x) ^^l{^,ixf/3;Yj)KhiXj - x), (3.1) 
i=i 

in which l{-;y) = ln{/y|x(y; •)} denotes the conditional log-likelihood function, Kh{-) = K{-/h)/h 
for a kernel function K and /i is a bandwidth. Let f3{x) = {Pq{x), . . . ,Pp{x))'^ be the local maximum 
likelihood estimator. Then, the local MLEs of 6{x) and m{x) are given by 9{x) = (3o{x) and 
fh{x) = b'{6{x)), respectively. A similar estimation procedure, based on the n — 1 observations 
excluding {Xi,Yi), leads to the local log-likelihood function £~*(/3;x), and the corresponding local 
MLEs, P (x), 9~'^{x), and m^*(x), respectively. 

3.1 Weighted Local-Likelihood 

To compute approximately /3 (x) from /3(x), we apply the "infinitesimal perturbation" idea devel- 
oped in Pregibon (1981). We introduce the weighted local log-likelihood function, 

n 

i^4/3; x) = Y, ^ij K^xfP; Y,)Kh{X, - x), (3.2) 

with the weight da = 6 and the rest weights 6ij = 1. Let (3^ ^{x) be the maximizer. Note that 
when 5=1, this estimator is the local maximum likelihood estimator and when (5 = 0, it is the 
leave-one-out estimator. 

The weighted local MLE is usually found via the Newton-Raphson iteration, 

/3L = /9L-i-{v'^^,5(/3L_i;x)}-iv45(/3L-i;a;), L = l,2,..., (3.3) 

where S/i denotes the gradient vector and V^^ the Hessian matrix. (Explicit expressions of S/i and 
V^^ are given in Lemma [5J) When the initial estimator /3q is good enough, the one-step {L = 1) 
estimator is as efficient as the fully iterated estimator (Fan and Chen 1999). 

The key ingredient for calculating the leave-one-out estimator is to approximate it by its one-step 
estimator using the "keep-all-in" estimator /3(x) as the initial value. Namely, (3 (x) is approximated 
by 1)3. 3() with (3q = (3{x) and 5 = 0. With this idea and other explicit formulas and approximation, 
we can derive the approximate leave-one-out formulas. 

3.2 Leave-One-Out Formulas 

Let X(3;) = (xi(x), . . .,x„(x))'^, 

W(x;/3) = diag{i^;,(X, - x)6"(xj(x)^/3)}, (3.4) 
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and Sn{x;P) = X{x fW{x;p)X{x). Define 

n{x; (3) = {W{x; P)}'/^X{x){Sn{x; /3)}-^X(x)^{W(:e; (3.5) 

This projection matrix is an extension of tlie liat matrix in the multiple regression and will be 
useful for computing the leave-one-out estimator. Let TCa^x; (3) be its z-th diagonal element and 
Hi = 7iii{Xi; P(Xi)). Then, our main results can be summarized as follows. 

Proposition 1 Assume condition {A2) in the Appendix. Then for i = 1, . . . ,n, 

{Sn{x;P{x))}-^^i{x)Kh{Xi - x){Yi - b'iMxfdix))} 



l-niiix;Pix)) 



Hi 



(5 {x)-fi{x) = ^gg^ 

(3.7) 

l-H-^ ' ^'-'^ 
Note that the approximation becomes exact when the loss function is the quadratic loss. In fact, 
Zhang (2003) shows an explicit delete-one-out formula. 



1 - H, 
Hi 



-{Y,-m)/b"{e,), 

-{Yi - fhi). 



Hi 



m- = m,- 



-{Yi - fhi). 



l-H, 

In addition, ()3. 6^ - 1)3. 8() hold for h — > oo, namely, the parametric maximum likelihood estimator. 
Even the results for this specific case appear new. Furthermore, the results can easily be extended 
to the estimator that minimizes the local Bregman divergence, replacing /(xj(x)"^/3; 1^) in (|3.H) by 
Q{Y,,g-\^,{xY(3)). 

Using Proposition^ we can derive a simplified formula for computing the cross- validated estimate 
of the overall prediction error. 



Proposition 2 Assume conditions {A\) and {A2) in the Appendix. Then 

n 

= [<3(^*'"i*) + 2-\"{m^){Y^ - mif{l - 1/(1 - Hif] 



i=l 



(3.9) 



Proposition |51 gives an approximation formula, which avoids computing "leaving-one-out" esti- 
mates, for all g-class of loss functions. In particular, for the function qi, we have 

n n 

YiY, - = Y.^Y, - m,)V(l " H.f ■ 

i=l 1=1 

For this particular loss function, the approximation is actually exact. For the function q2 leading to 
the deviance loss Q2 defined in (|2.3() . we have 

{Y^ - rhif 



1=1 



1=1 



Q2{Yi,mi) 



b"{ei) 



-{I Hi)'] 



(3.10) 



For the exponential loss defined in ()2.4() for binary classification, we observe 

{Yi - mi)'' 



J2Qs{Y,,mr) = Y^ 



i=l 



i=l 



Q3{Yi,mi 



4{mj(l - mj)}3/2 



{1-1/{1-H,)'} 
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3.3 Two Theoretical Issues 



Two theoretical issues are particularly interesting. The first one concerns the asymptotic convergence 
of /lACV; the minimizer of the right hand side of (|3.9|) . Following a suitable modification to the result 
of Altman and MacGibbon (1998), the ratio /iacv/^ampec converges in probability to 1, where 
^AMPEC is the minimizer of the asymptotic form of MPEC(/i) defined in (|2.8|) . 

The explicit expression of /iampeC) associated with the g-class of error measures, can be obtained 
by the delta method. Setting —2^^q"{m{x)){b"{9{x))}'^ to be the weight function, /iampec (for odd 
degrees p of local polynomial fitting) can be derived from Fan, et al. (1995, p. 147): 

a(V') / b"{e{x))q"{m{x))dx ^ 



^AMPEc(g) = Cp{K) 



n-i/(2p+3), (3.11) 



J{0(P+1) {x)y{b"{9{x))}^q"{m{x))fx {x)dx_ 
where Cp{K) is a constant depending only on the degree and kernel of the local regression. In 
particular, for the g2-function which gives the deviance-loss, we have 



/lAMPEc(g2) = Cp{K) 



l/(2p+3) 

n-i/(2p+3), (3.12) 



f{eiP+^){x)}H"{e{x))fx{x)dx_ 

where \i^x\ measures the length of the support of fx- Apparently, this asymptotically optimal 
bandwidth differs from the asymptotically optimal bandwidth, 

■ a{ijj) f{b"{e{x))}-^dx T i/(2p+3) 



/iAMISE = Cp{K) 



j{e(p+^){x)}^fx{x)dx 



n-^/(2f+3), (3.13) 



determined by minimizing the asymptotic MISE(/i) of 9 defined in (|2.9)1 . with an exception of the 
Gaussian family. 

[ Put TabieQ] about here ] 

The second issue concerns how far away /iampec (92) departs from /iamise- For Poisson and 
Bernoulli response variables, examples in Table ^ illustrate that the distinction between /iampec((?2) 
and /lAMiSE can be noticeable. To gain further insights, we will need the following definition. 

Definition 1 Two functions F and G are called similarly ordered if {F{xi) — F{x2)}{G{xi) — 
G{x2)} > for all xi in the domain of F and all X2 in the domain of G, and oppositely ordered if 
the inequality is reversed. 

The following theorem characterizes the relation between /iampec ('72 ) and /iamise- 

Proposition 3 Define F{x) = {e^P+''\x)}'^b" {e{x))fx{x) and G{x) = {b"{6{x))Y^. A ssume that p 
is odd. 

(a) // F and G are oppositely ordered, then /iampec (92) ^ /lAMlSE- If F and G are similarly 
ordered, then /iampec (92) > /iamise- 
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(6) Assume that b"{9{x)) is bounded away from zero and infinity. Write mj,// = minx^Ux ^"(^(^)) 
and My I = maXj^gQ^^ h"{6{x)). If 9{x) is a polynomial function of degree p + 1, and fx is a 
uniform density on 0,x, then 

\{mh»+Mbnyj ~ /lAMISE ~ 

in which the equalities are satisfied if and only if the exponential family is Gaussian. 

3.4 Empirical Cross- Validation 

The approximate CV criterion H3.9|) can be further simphfied. To this end, we first approximate 
the "degrees of freedom" ^l^i Hi (Hastie and Tibshirani 1990). To facihtate presentations, we now 
define the "equivalent kernel" /C(t) induced by the local-polynomial fitting as the first element of the 
vector S~^{l,t, . . . ,tP)'^K{t), in which the matrix S = (^i+j_2)i<ij<p+i with Hk = J t^K{t) dt. See 
Ruppert and Wand (1994). 

Proposition 4 Assume conditions (A) and (B) in the Appendix. If n ^ oo, /i — > 0, and nh — > oo, 

we have 

n 

Y,H, = ICmnx\/h{l + op{l)}, 
1=1 

where \^x\ denotes the length of the support of the random variable X. 

Proposition |3 shows that the degrees of freedom is asymptotically independent of the design 
density and the conditional density. It approximates the notion of model complexity in nonparametric 
fitting. 

[ Put Tahle^ahout here ] 

Proposition ^ does not specify the constant term. To use the asymptotic formula for finite 
samples, we need some bias corrections. Note that when h ^ oo, the local polynomial fitting 
becomes a global polynomial fitting. Hence, its degrees of freedom should be p + 1. This leads us to 
propose the following empirical formula: 

n 

J2Hi = {p+l-3.)+ Cn/{n - l)IC{0)\nx\/h. (3.14) 
1=1 

In the Gaussian family, Zhang (2003) used simulations to determine the choices a and C. See Table 
1^1 which uses the Epanechnikov kernel function, K{t) = .75(1 — t'^)+. Interestingly, our simulation 
studies in Section [7| demonstrate that these choices also work well for Poisson responses. However, 
for Bernoulli responses, we find that for p = 1, slightly different choices given by a = .7 and C = 1.09 
provide better approximations. 



11 



We propose the empirical version of the estimated total prediction error by replacing Hi in ()3.9() 
with their empirical average, He = (p + 1 — a)/n + C/(n — l)IC{0)\Qx\/h, leading to the empirical 
cross-validation (ECV) criterion, 



This avoids calculating the smoother matrix 7i. Yet, it turns out to work reasonably well in practice. 
A data-driven optimal bandwidth selector, /iecV) can be obtained by minimizing (|3.15|) . 

4 Nonparametric Logistic Regression 

Nonparametric logistic regression plays a prominent role in classification and regression analysis. 
Yet, distinctive challenges arise from the local MLE and bandwidth selection. When the responses 
in a local neighborhood are entirely zeros or entirely ones (or nearly so), the local MLE does not 
exist. Miiller and Schmitt (1988, p. 751) reported that the local-likelihood method suffers from a 
substantial proportion of "incalculable estimates". Fan and Chen (1999) proposed to add a ridge 
parameter to attenuate the problem. The numerical instability problem still exists as the ridge 
parameter can be very close to zero. A numerically viable solution is the lower bound method, which 
we now introduce. 

4.1 Lower Bound Method for Local MLE 

The lower-bound method is very simple. For optimizing a concave function C, it replaces the Hessian 
matrix V^£(/3) in the Newton-Raphson algorithm by a negative definite matrix B, such that 



Lemma n shown in Bohning (1999, p. 14), indicates that the Newton-Raphson estimate, with the 
Hessian matrix replaced by the surrogate B, can always enhance the target function C 

Lemma 1 Starting from any Pq, the LB iterative estimate, defined by /3lb = Pq — B~^v£(/3o), 
leads to a monotonic increase of C{-), that is, £(/3lb) — C{(3q) > — 2^-*^ v£(/3o)"^B~^ v£(/3o) > 0. 

For the local logistic regression, v'^i{P;x) = — X(x)-'"W(x; /3)X(x). Since < W(x;/3) < 
4~-'^K(x), where K.{x) = diag{Kh{Xj — x)}, the Hessian matrix V^£(/3; x) indeed has a lower bound. 



Err^'^^ih) = [Q{Yi,m) + 2'^q"{mi){Yi - mif{l - 1/(1 - He)^} 



(3.15) 



1=1 



V^£{f3) > B, for ah /3. 



B(x) 



4-^X(x)^ K{x)X{x) 



(4.1) 



and the LB-adjusted Newton-Raphson iteration for computing (3{x) becomes 



Pl = Pl-1 - {B{x)}~'XixfKix)r{x; P^-i), L = l,2,... 



(4.2) 
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where r(x; f3) = (li — mi{x; (3), . . . ,Yn — mn{x; f3))'^ with mj{x; (3) = 1/[1 + exp{— Xj(3;)-^/3}]. 

The LB method offers a number of advantages to compute (3{x). Fh'stly, the corresponding LB 
matrix B(x) is free of the parameter vector f3, and thus can be computed in advance of the NR 
iteration. This in turn reduces the computational cost. Secondly, the LB matrix is stable, as it is 
the same matrix used in the least-squares local-polynomial regression estimates and does not depend 
on estimated local parameters. Thirdly, since the local-likelihood function £(/3; x) is concave, the LB 
iteration is guaranteed to increase i{(3;x) at each step and converge to its global maximum (3{x). 

4.2 A Hybrid Bandwidth Selection Method 

For binary responses, our simulation studies show that the bandwidth choice minimizing H3.9|) or 
its empirical version H3.15() tends to produce over-smoothed estimates. Such a problem was also 
encountered in Aragaki and Altman (1997) and Fan, Farmen and Gijbels (1998, Table 1). Because 
of the importance of binary responses in nonparametric regression and classification, a new bandwidth 
selector that specifically accommodates the binary responses is needed. 

We first employ the LB scheme ()4.2|) to derive a new one-step estimate of /3 (x), starting from 
3(x). Define Sn{x) = X(a;)^K(a;)X(x) and Si = ef{Sn{Xi)}-^eiKh{0), where ei = (1, 0, . . . , 0)^. 
The resulting leave-one-out formulas and the cross-validated estimates of the total prediction error 
are displayed in Proposition |S1 

Proposition 5 Assume conditions (Al) and {A2) in the Appendix. Then for the local-likelihood 
MLE in the Bernoulli family, 

^ R(^\ - ^{Sn{x)}-^Mx)Kh[X, - x){Y, - b'{^,{xfhx))} 

" • ^'^^ {Y^-m), (4.4) 



1-5, 



fhr-m = -^^^^(Y^-m), (4.5) 

i — Oj 

n 

= Yl [QiY^^^i) + 2-'q"{m){Yi - mif[l - {1 + Ab"{e,)SJ{l - 5^)}^]] .(4.6) 



i=l 



Direct use of a bandwidth selector that minimizes 1)4. 6() tends to under-smooth the binary re- 
sponses. To better appreciate this, note that the second term in 1)4. 6(1 is approximately 

- q"{m){Yi - mif{Ab"i9i)}Si, (4.7) 

and the second in (|3.9|) can be approximated as 

-q"im){Yi-mfH,. (4.8) 
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As demonstrated in Lemma|31in the Appendix, Si decreases with h and Hi = Si. Since < Ab"{9i) < 
1 for the Bernoulh family, H4.7|) down weighs the effects of model complexity, resulting in a smaller 
bandwidth. 

^cv 

The above discussion leads us to define a hybrid version of Err as 

n 

[Q{Yi, fhi) + 2-\"{mi){Yi - fhif [1 - {1 + 2b"{ei)S,/{l - Si) + 2'^ Hi/ {I - i/i)}']] , (4-9) 

i=l 

which averages terms in ()4.7() and H4.8|) to mitigate the oversmoothing problem of criterion ()3.9|) . This 
new criterion has some desirable properties: 2b"{6i)Si/{l — Si) + 2^^Hi/{\ — Hi) is bounded below 
by 2^^Hi/{l — Hi), thus guarding against under-smoothing, and is bounded above by {Si/{1 — Si) + 
Hi/(1 — Hi)}/2, thus diminishing the influence of over-smoothing. An empirical cross-validation 
criterion is to replace Si and Hi in 1)4. 9|) by their empirical averages, which are ()3.14|) divided by n. 
A hybrid bandwidth selector for binary responses can be obtained by minimizing this ECV. 



5 Extension to Generalized Varying- Coefficient Model 

This section extends the techniques of Sections 121 and 0] to a useful class of multi-predictor models. 
The major results are presented in Propositions IHHHl 

Consider multivariate predictor variables, containing a scalar U and a vector X = {Xi, . . . , X^)'^ . 
For the response variable Y having a distribution in the exponential-family, define by m{u, x) = 
E(Y\U = n, X = x) the conditional mean regression function, where x = {xi, . . . ,Xd)'^ ■ The 
generalized varying-coefficient model assumes that the d + 1-variate canonical parameter function 
9{u,x) = g{m{u,x)), with the canonical link g, takes the form 

d 

g{m{u,x)) = 0{u,x) = ^afc(u)xfc = x^A{u). (5.1) 

k=l 

for a vector A{u) = (ai(u), . . . , ad{u))'^ of unknown smooth coefficient functions. 

We first describe the local-likelihood estimation of A{u), based on the independent observations 
{(f/j, Xj, Yj)"^-^}. Assume that afc(-)'s are {p + l)-times continuously differentiable at a fitting point 
u. Put AW(n) = {af{u), . . .,af{u))'^. Denote by p{u) = A(i)(n)^, . . . , A^p\u)^ / p\)^ the 

d{pA-l) by 1 vector of coefficient functions along with their derivatives, Uj(u) = (1, (Uj — u), . . . , {Uj — 
u)P)'^, and Id a d X d identity matrix. For observed covariates Uj close to the point u, 

A{Uj) = A{u) + {Uj - n)A(i)(n) + ••• + ([/,•- u)PA^P\u)/pl = {u,{u) Id}^l3{u), 

in which the symbol denotes the Kronecker product, and thus from ()5.1|) . 

e{Uj,lj) = {uj{u) (g) Xj}^(3{u). 
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The local-likelihood MLE (3{u) maximizes the local log-likelihood function: 

n 

e{/3;u) = Y,K{^jiu)C^^jfP;Y,)Kh{U,-u). (5.2) 

The first d entries of P{u) supply the local MLEs A{u) of A{u), and the local MLEs of 9{u,x) and 
m{u,x) are given by 6{u,x) = x^A{u) and rh{u,x) = b' {6{u,x)), respectively. A similar estima- 
tion procedure, applied to n — 1 observations excluding (C/j,Xj,li), leads to the local log-likelihood 
function, £~^{P;u), and the corresponding local MLEs, (3 {u), 9^^{u,x), and m~'^{u,x) respectively. 

5.1 Leave-One-Out Formulas 

To derive the leave-one-out formulas in the case of multivariate covariates, we need some additional 
notations. Let X*(n) = (ui(ii) Xi, . . . ,Un{u) (g) X„)^, W*(n;/3) = diag{Kh{Uj - u)b"{{uj{u) (g 
Xj}'^/3)}, and S*{u;P) = X*(n)^W*(u; /3)X*(n). Define a projection matrix as 

n*{u;(3) = {W*{u-,p)}'/'X*{u){SUn-,m-'^*iuf{^*{u-,P)V^'. 

Let 7i*^{u; (3) be its ith diagonal entry and = H*^{Ui; (3{Ui)). Propositions IHl and [71 below present 
the leave-one-out formulas and cross- validated estimate of the total prediction error. 

Proposition 6 Assume condition {A2) in the Appendix. Then for i = 1, . . . ,n, 

{S*{u; p{u))}-^{ui{u) Xi}Kh{U, - u){Yi - h'{{xxi{u) ® X,}^3(n))} 



(3 {u)-l3{u) 



-{Y,-m)/b"{ei), 



1 - m 



mi = (Yi-fhi). 



Proposition 7 Assume conditions (Al) and {A2) in the Appendix. Then 

n 

Err = [Q(y„mi)+ 2-i(7"(m,)(y,-m,)'{ 1-1/(1 -^D'lJ- (5-3) 

i=\ 

5.2 Empirical Cross- Validation 

In the generalized varying-coefficient model, the asymptotic expression of the degrees of freedom 
^Y^=\ is given below. 

Proposition 8 Assume conditions (A) and (C) in the Appendix. If n ^ oo, h ^ 0, and nh oo, 

we have 

n 

Y,H* = dlCiO)\nu\/h{l + op{l)}. 
1=1 
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As /i — > oo, the total number of model parameters becomes d{p + 1) and this motivates us to 
propose the empirical formula for degrees of freedom: 

n 

Y,H* = d{{p + l-a.)+Cn/{n-d)IC{0)\nu\/h}. (5.4) 

i=l 

The empirical version of the estimated total prediction error is to replace H* in (|5.3|) hy d{{p + 1 — 

— ECV 

a)/n + C/(n — d)IC{0)\Qif\/h}. Call Err (h) the empirical version of the cross-validation crite- 

_ — ECV 

rion. Compared with the bandwidth selector in Cai, Fan and Li (2000), the Err (/i)-minimizing 
bandwidth selector, /iecVi is much easier to obtain. 

5.3 Binary Responses 

For Bernoulli responses, the LB method in Section0]continues to be applicable for obtaining P{u) and 
P (n). For the local logistic regression, y'^i{j3;u) has a lower bound, B(n) = — 4^^X*(u)^K*(n)X*(u), 
where K*(u) = dmg{Kh{Uj — u)}. Similar to ()4.2() . the LB-adjusted NR iteration for [3{u) proceeds 
as follows, 

f^L = - {B{u)}-'X*{ufK*{u)r*{u-PL-i), L=l,2,..., 

where r*(n;/3) = (yi-m*(n; /3), . . . , y„-m;(ti; /3))^ with m*(u; /3) = 1/(1 + exp[-{uj(n) ® Xj}^/3]). 
The leave-one-out formulas and the cross-validated estimates of the prediction error are similar 
to those in Proposition [51 with Sn{x) replaced by S*{u) = X*(n)-^K*(n)X*(ti) and Si by S* = 
{ei(^Xi)'^ {S*{Ui)}'~^ {ei(^li)Kh{0) . In the spirit of 1)4. 9|) . the hybrid selection criterion for bandwidth 
is 

n 

J2 [QiY^,m) + 2-\"{mm - [1 - {1 + 2b"{9,)S:/{l - S*) + 2-^H*/{l - H*)]^]] . (5.5) 

i=l 

The ECV criterion can be obtained similarly via replacing S* and H* by their empirical averages, 
which are 1)5. 4(1 divided by n. 



6 Extension to Semiparametric Model 

A further extension of model 1)5. 1|) is to allow part of the covariates independent of C/, resulting in 

6l(n,x,z) = x'^A(u) -Fz"^/?, (6.1) 

in which (n, x,z) lists values of all covariates ([/, X,Z). This model keeps the flexibility that Z does 
not interact with U . The challenge is how to choose a bandwidth to efficiently estimate both the 
parametric and nonparametric components. In this section, we propose a two-stage bandwidth 
selection method which is applicable to general semiparametric models. This is an attempt to 



16 



answer an important question raised by Bickel and Kwon (2001) on the bandwidth selection for 
semiparametric models. 

The parameters in model ()6.1() can be estimated via the profile likelihood method. For each 
given /3, applying the local-likelihood method with a bandwidth h, we obtain an estimate A{u; f3, h), 
depending on h and /3. Substituting it into (|6.1() . we obtain a pseudo parametric model: 

e{u,x,z) = x^A{u;l3,h) +z^l3. (6.2) 

Regarding ()6.2|) as a parametric model with parameter /3, by using the maximum likelihood estima- 
tion method, we obtain the profile likelihood estimators /3(/i) and A{u; (3{h), h). This estimator is 
semiparametrically efficient. 

We now outline a two-stage method for choosing the bandwidth. The idea is also applicable to 
other semiparametric problems. Starting from a very small bandwidth ho, we obtain a semipara- 
metric estimator /3(/io) (see a justification below). To avoid difficulty of implementation, the nearest 
type of bandwidth can be used. This estimator is usually root-n consistent. Thus, /3 in model ()6.1() 
can be regarded as known and hence model (|6.1|) becomes a varying coefficient model. Applying a 
bandwidth selection method for varying coefficient models, such as the approximate cross-validation 
method in the previous section, we obtain a bandwidth h. Using this h, we obtain the profile likeli- 
hood estimator /3(/i) and A{u; (3{h), h). This is a two-stage method for choosing the bandwidth for 
a semiparametric model. 

To illustrate the idea, we specifically consider the partially linear model: 

Yi = a{Ui) + zJfi + ei, i = l,...,n. (6.3) 

Assume that the data have already been sorted according to Ui. Let /iq be the nearest two-point 
bandwidth so that 

a{Ui;P, ho) = 2-\Yi - zf/3 + Yi^i - zf_i/3). 
Substituting this into 1)6. 3|) and rearranging the equation, we have 

Yi-Yi-i = {Zi-Zi^if(3 + 2ei. (6.4) 

Applying the least-squares method, we obtain an estimator /3(/io). 

To see why such a crude parametric estimator j3{ho) is root-n consistent, let us take the difference 
of (|6.c{jl . Under the mild conditions, a(Ui) — a{Ui-i) = Op{n^^). Hence, the difference of (|6.3j) yields 

Yi - Yi^i = {Zi - Zi_i)^/3 + ei- Ei.i + Op(n-i). 

Hence, the least-squares estimator (3, which is the same as (3{ho), is root-n consistent. See Yachew 
(1997) for a proof. This example shows that even if a very crude bandwidth ho is chosen, the 
parametric component /3(/io) is still root-n consistent. 
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The two-stage bandwidth selector is to apply a data-driven bandwidth selector to the following 
univariate nonparametric regression problem: 

Yi = a{Ui) + z[^{ho)+£i, 

and to use the resulting bandwidth for the original semiparametric problem. Such an idea was 
implemented in Fan and Huang (2005). They reported that the resulting semiparametric and non- 
parametric estimators are efficient. 

7 Simulations 

The purpose of the simulations is three-fold: to assess the accuracy of the empirical formulas p. 14(1 
and ()5.4() . the performance of the bandwidth selector /iecVj and the behavior of the proposed band- 
width selector for local-likelihood estimation. For Bernoulli responses, we apply the hybrid bandwidth 
selector to local logistic regression. Throughout our simulations, we use the g2-function associated 
with the deviance loss for bandwidth selection, combined with the local-linear likelihood method and 
the Epanechnikov kernel. Unless specifically mentioned otherwise, the sample size is n = 400. A 
complete copy of Matlab codes is available upon request. 

7.1 Generalized Nonparametric Regression Model 

For simplicity, we assume that the predictor variable X has the uniform probability density on the 
interval (0, 1). The bandwidth /iecv is searched over an interval, [/imirn -5], at a geometric grid of 30 
points. We take /imin = 3 /iq for Poisson regression, whereas for logistic regression, we take hmm = 5 /iq 
in Example 1 and /imin = -1 in Examples 2-3, where ho = max[5/n, max2<j<n{-^(j) — with 
-^(1) < • • • < X(^n) being the order statistics. 

[ Put Figure\^ about here ] 

Poisson regression: We first consider the response variable Y which, conditional on X = x, 
follows a Poisson distribution with parameter X{x): 

P{Y = y\X = x) = {X{x)}y exp{-A(x)}/y!, y = 0, 1, . . . . 

The function 9{x) = ln{A(2;)} is given in the test examples. 

Example 1: 0{x) = 3.5[exp{-(4x - 1)^} + exp{-(4x - 3)^}] - 1.5, 
Example 2: e{x) = sin{2(4x - 2)} + 1.0, 
Example 3: 6l(x) = 2 - .5(4x - 2)^. 
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As an illustration, we first generate from (X, Y) one sample of independent observations {{Xj, Yj)^^^}. 
Figure Ol^a) plots the degrees of freedom as a function of h. It is clearly seen that the actual val- 
ues (denoted by dots) are well approximated by the empirical values (denoted by circles) given by 
H3.14() . To see the performance of /iecV) Figure E^b) gives boxplots of the relative error, {/iecv — 
/iAMPEc(g2)}//iAMPEc(92) and {/lECV " /iAMiSEj/^iAMiSE, based on 100 random samples; refer to 
Tabled for values of /iAMPEc('?2) and /iamise- We observe that /iecv is closer to /iAMPEc(^2) than to 
^amise; this is in accordance with the discussion of Section 1531 In Figure |31^c) , we simulate another 
100 random samples and for each set obtain /iecv to estimate 9{x). We present the estimated curves 
from three typical samples. The typical samples are selected in such a way that their ASE values, 
in which ASE = n"^ Y^]=i{0{Xj) - 0{Xj)}^, are equal to the 25th (dotted line), 50th (dashed line), 
and 75th (dash-dotted line) percentiles in the 100 replications. Inspection of these fitted curves sug- 
gests that the bandwidth selector based on minimizing the cross- validated deviance does not exhibit 
undersmoothing in the local-likelihood regression estimation. In Figure |31 similar results are also 
displayed in the middle panel [Figures intd)-(f )] for Example 2, and in the bottom panel [Figures 
inig)-(i)] for Example 3. 

[ Put Figure^about here ] 

Logistic regression: We now consider the Bernoulli response variable Y with canonical param- 
eter, 6{x) = logit{P(y = 1\X = x)}, chosen according to 

Example 1: 9{x) = 7[exp{-(4x - 1)^} + exp{-(4x - 3)^}] - 5.5, 
Example 2: 9{x) = 2.5 sin(27rx). 
Example 3: e{x) = 2- (4x -2f. 

In Figure |1J we conduct the simulation experiments serving a similar purpose to Figure 01 Plots in 
the middle (vertical) panel lend support to the convergence of the hybrid bandwidth selector /iecv 
to /iAMPEc(^2)5 without suffering from the under- or over-smoothing problem. 

7.2 Generalized Varying-CoefRcient Model 

We consider examples of the generalized varying-coefficient model (|5.1I) . We take /imin = 3 /iq for 
Poisson regression, where /iq = max[5/n, max2<j<n{f^(j) — and hmm = -1 for logistic regres- 

sion. 

[ Put Figure [31 about Jiere ] 

Poisson regression: We consider a variable Y, given values (u, x) of the covariates (U,X), 
following a Poisson distribution with parameter A(u, x), where the varying-coefficient functions in 



19 



1ii{A(m, x)} are specified below: 

Example 1: d = 2, ai{u) = 5.5 + .1 exp(2ti — 1), a2{u) = .8n(l — u), 

Example 2: d = 3, ai{u) = 5.5 + .1 exp(2ti — 1), 02(14) = .8n(l — u), a3{u) = .2 sin^(27ru). 

We assume that [/ is a uniform random variable on the interval [0, 1] and is independent of X = 
{Xi,X2,X3)^, with Xi = 1, where {X2,X3) follows a zero-mean and unit-variance bivariate normal 
distribution with correlation coefficient 1 / -v/2. In Figure El plot (a) reveals that the actual degrees 
of freedom are well captured by the empirical formula H5.4() . To evaluate the performance of /iecV; 
we generate 100 random samples of size 400. Figure intb)-(c) plot the estimated curves of ai{u) 
and a2{u) from three typical samples. The typical samples are selected so that their ASE values, in 
which ASE = n'^ Yl^iMUj^^j) - e{Uj,Xj)y, correspond to the 25th (dotted line), 50th (dashed 
line), and 75th (dash-dotted line) percentiles in the 100 replications. The corresponding results for 
Example 2 are given in Figure[5fa')--(d'). These plots provide convincing evidences that /iecVi when 
applied to recovering multiple smooth curves simultaneously, performs competitively well with that 
to fitting a single smooth curve. 

[ Put Figure\^ about here ] 

Logistic regression: We now consider the varying-coefficient logistic regression model for 
Bernoulli responses, in which varying-coefficient functions in logit{P(y = 1\U = u, X = x)} are 
specified as 

Example 1: d = 2, ai{u) = 1.3{exp(2n - 1) - 1.5}, a2{u) = 1.2{8ti(l - u) - 1}, 

Example 2: d = 3, ai{u) = exp(2n — 1) — 1.5, a2{u) = .8{8ti(l — u) — 1}, a3{u) = .9{2sin(7rn) — 

We assume that Xi = 1; X2 and X3 are uncorrelated standard normal variables, and are inde- 
pendent of U ^ f7(0, 1). Figure ini depicts plots whose captions are similar to those for Figure El 
Compared with previous examples of univariate logistic regression and varying-coefficient Poisson 
regression, the current model fitting for binary responses is considerably more challenging. Despite 
the increased difficulty, the LB local-likelihood logistic regression estimates, through the use of the 
hybrid bandwidth selector /iecVj captures the major features of the model structure with reasonably 
good details. 

8 Real Data Applications 

In this section, we apply the hybrid bandwidth selection method for binary responses to analyze an 
employee dataset (Example 11.3 of Albright, et al. 1999) of the Fifth National Bank of Springfield, 
based on year 1995 data. The bank, whose name has been changed, was charged in court with that 
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its female employees received substantially smaller salaries than its male employees. For each of its 
208 employees, the dataset consists of eight variables, including 

• JobGrade: a categorical variable for the current job level, with possible values 1-6 (6 is highest) 

• YrHired: year employee was hired 

• YrBorn: year employee was born 

• Gender: a categorical variable with values "Female" and "Male" 

• YrsPrior: number of years of work experience at another bank prior to working at Fifth National. 

The data set was carefully analyzed by Fan and Peng (2004). After adjusting for covariates such as 
age, years of work experience, and education level, they did not find stark evidence of discrimination. 
However, they pointed out that 77.29% (i?^) of the salary variation can be explained by the job grade 
alone and the question becomes whether it was harder for females to be promoted, after adjusting for 
confounding variables such as age and years of work experience. They did not carry out the analysis 
further. 

To understand how the probability of promotion to high levels of managerial job (and thus high 
salary) is associated with gender and years of work experience, and how this association changes 
with respect to age, we fit a varying-coefficient logistic regression model, 

logit{P(y = l\U = u,Xi = xi,X2 = X2)} = aoiu) + ai{u)xi + a2(n)x2, (8.1) 

with Y the indicator of JobGrade at least 4, U the covariate Age, Xi the indicator of being Female, 
and X2 the covariate WorkExp (calculated as 95 — YrHired + YrsPrior). Following Fan and Peng (2004), 
outliers have been deleted, with the remaining 199 data for analysis. For this medium-lengthed data, 
use of the bandwidth selector h^cY which minimizes 1)5. 5() seems to be more natural than /iecv- 

[ Put Figure\^ about here ] 

Our preliminary study shows a monotone decreasing pattern in the fitted curve of 02 (u). This is 
no surprise; the covariates Age and WorkExp are highly correlated, as can be viewed from the scatter 
plot in Figure [7|^a) . Such high correlation may cause some identifiability problem, thus in model 
(|8.1|) . we replace X2 with a de-correlated variable, X2 — E{X2\U), which is known to be uncorrelated 
with any measurable function of U. The projection part, E{X2\U = u), can easily be estimated by 
a univariate local linear regression fit. Likewise, its bandwidth parameter can simply be chosen to 
minimize the approximate cross-validation function (for Gaussian family), illustrated in Figure [7Kb). 

After the de-correlation step, we now refit model (|8.1() . The bottom panel of Figure [7| depicts 
the estimated varying coefficient functions of ao(ti), ai{u) and a2{u), plus/minus the pointwise 1.96 
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times of their estimated standard error. The selected bandwidth is 16.9 [see Figure |7fc)]. Both the 
intercept term and (de-correlated) WorkExp have the statistically significant effects on the probability 
of promotion. As an employee gets older, the probability of getting promoted keeps increasing until 
around 40 years of age and levels off after that. It is interesting to note that the fitted coefficient 
function of ai(n) for gender is below zero within the entire age span. This may be interpreted as 
the evidence of discrimination against female employees being promoted and lends support to the 
plaintiff. 

[ Put Figure\^ about here ] 

To see whether the choice of smoothing variable U makes a difference in drawing the above 
conclusion, we fit again model (|8.1|) with U given by the covariate WorkExp and X2 by the de- 
correlated Age (due to the same reason of monotonicity as in the previous analysis). Again, Figure |S1 
shows that gender has an adverse effect and the evidence for discrimination continues to be strong. 
Indeed, the estimated varying- function of ai{u) is qualitatively the same as that in Figure [71 as far 
as the evidence of discrimination is concerned. 

We would like to make a final remark on the de-correlation procedure: This step does not alter 
1)8. H) . particularly the function ai(-). If this step is not taken, then the estimate of ai{u) from either 
choice of U continues to be below zero and thus does not alter our previous interpretation of the 
gender effect. 
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Appendix: Proofs of Main Results 

We first impose some technical conditions. They are not the weakest possible. 
Condition (A): 

(Al) The function q is concave and q"{-) is continuous. 
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(A2) b"{-) is continuous and bounded away from zero. 

(A3) The kernel function if is a symmetric probability density function with bounded support, and 
is Lipschitz continuous. 

Condition (B): 

(Bl) The design variable X has a bounded support Ox and the density function fx which is Lipschitz 
continuous and bounded away from 0. 

(B2) 6{x) has the continuous {p + l)-th derivative in Ox- 
Condition (C): 

(CI) The covariate U has a bounded support Qjj and its density function fu is Lipschitz continuous 
and bounded away from 0. 

(C2) aj{u), j = 1, . . . ,d, has the continuous (p + l)-th derivative in 0[/. 

(C3) For the use of canonical links, the matrix T{u) = E{b"{9{u,X))XX^\U = u} is positive definite 
for each u S Qjj and is Lipschitz continuous. 

Notations: Throughout our derivations, we simplify notations by writing 9j{x;(3) = :x.j{x)'^f3, 
mj{x; (3) = h'{d,{x- (3)), Zj{x; (3) = {F^ -m,-(x; (5)} /b" {6j{x- p)), z{x; (3) = (Zi(x; /3), . . . , Z„(x; /3))^, 
and Wj{x;(3) = Kh{Xj — x)b" {9j{x; (3)); their corresponding quantities evaluated at /3(x) are denote 
by 6j{x), mj{x), Zj{x), z(x), and Wj{x). Similarly, define Sn{x) = Sn{x; f3{x)). 
Before proving the main results of the paper, we need the following lemma. 

Lemma 2 Define Vj((5) = diag{5ji, • • • ,<5m}- For £i^s{/3',x) defined in ((23), 

Vi,4(3;x) = X{xfYi{6)W{x;p)z{x;(3)/a{i;), (A.l) 
V^ei,s(.(3;x) = -X(x)^V,(5)W(x;/3)X(x)/a(V), (A.2) 

in which 

X(x)^V,(5)W(x; (3)z{x; (3) = X{xfW{x; /3)z(x; /3) - (1 - 5)x,{x)Kh{X, - x){Yi - mi{x; /3)},(A.3) 
X{x fYi{6)W{x;(3)X{x) = X{xfW{x; (3)X{x) - {I - 6)wi{x; p)^i{x):>Ci{x f . (A.4) 

Proof: Defining a vector 9{x; f3) = {9i{x\ (3), . . . , 9n{x; P))'^ , we have that 

V4,5(/:',xj - Qe^^.p^ ' (A.5) 
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Since 

n 

£^4f3;x) = Y,6ij[{YjXj{xfp - b{^j{xf P)}/a{i;) + c{Y,,i;)]Kh{Xj - x), 
it is easy to check that 

= 5,,{Y, - b'{e,{x;P))}K,{X, - x)/a{^P). (A.7) 

This combined with ()A.5|) leads to (|A.1|) . 
Following (|A.7|) . we see that 

This along with (|A.6() indicates HA.2() . 

()A.3|1 and HA.4|1 can be obtained by decomposing the identity matrix I into Vj((5) and I — Vj((5). 

Proof of Proposition [T] 

From (UHl and (|X2|l . (jHSl) can be rewritten as 

Pl = f^L-i + {X(x)^Vi(5)W(x; /3i_i)X(x)}-HX(x)^V,(5)W(x; /3i_i)z(x; (3^^,)}. (A.9) 

Setting 5 = in HA.9|) . the one-step estimate of (3 (x), which starts from /3q = (3{x), is given by 

p{x) + {X(x)^Vi(0)W(x; 3(aj))X(x)}-i{X(x)^Vi(0)W(x; 3(x))2(x)}. (A.IO) 

Using the definition of /3(x) (satisfying \/ii^s{f3',x) = with 6 = 1), along with HA.3|) and HA.4|1 . the 
above one-step estimate of /3 (x) equals 

Pix) - {Sn{x) - Wi{x)Mx)xi{xfr^Mx)Kh{X, - x){Yi - mix)}. (A.ll) 

According to the Sherman-Morrison- Woodbury formula (Golub and Van Loan 1996, p. 50), 

{Sn[x) - Wi{x)l^i(x)^i{x) } ={Sn(x)} + ^ ^ — • 

1 - -Wi^XjXj^x)^ |6n(x)j ^Xj(x) 

Thus 

{Sn{x) - ■«}j(x)Xj(x)Xj(x)^}"^Xj(x) = {S„(x)}"^Xj(x) 

^ Wi(x){5'n(x)}~^Xi(x)xi(x)'^{5'n(3;)}~^Xi(x) _ {5n(x)}~^Xj(x) 

1 - u;i(x)xi(x)'^{S'„(x)}-ixj(x) 1 - Unix; P{x)) 

by which HA.11|) becomes 

{Sn{x)y^Xi{x)KhiXi - x){Yi - rhiix)} 



(3{x) 



l-nii{x;f3{x)) 
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This expression approximates (3 {x) and thus leads to (|3.6j) . 

Note that Oi = OiiXi), m, = m,{Xi), Qr^ = 9~'{Xi), m-' = m-\Xi), and 

Hi = eJ{Sn{Xi)}-^eiKhiO)b"{ei). (A.12) 

Applying H3.6|) . we have 

er^-e, = eJ0~\x,)-p{x,)} 

leading to (|3.7|) . This, together with a first-order Taylor's expansion and the continuity of h" , yields 

m~' - = h'{er^) - b'{9i) = [Or' - 9,)b"{ei), 

and thus ()3.8|) . 

Proof of Proposition [21 

By a first-order Taylor expansion, we have that 

% - = 2-H'7'(^r) - q'{m)] = 2-\'\mi){mr' - mi), 
Q{m-\mi) = -2~^q"{mi){m~' -mif. 

These, applied to an identity given in the Lemma of Efron (2004, Section 4), 

Q{Yi,m-') - Q{Yi,mi) = 2{% -\t^){Y, - mT^) - Q{mr\mi), 

lead to 

QiYi,,m-') - QiYi,mi) = q"{mi){m-' - m,)(l^i - m"^) + 2-\"{mi){m-' - fh,f 

= 2-\"{mi){{Yi - m.i)^ - {Yi - fhr^f]. 

Summing over i and using (|3.8() and ()2.7|) . we complete the proof. 

Proof of Proposition [31 

From and ([TT^ . we see that 

^AMPEc(92) 



/^AMISE 



lOxl Co F{x)Gix)dx ll/(2p+3) 

\ \ I y J . (A.13) 



!n^F{x)dxLG{x)dx 



For part (a), it suffices to consider oppositely ordered F and G. In this case, by the Tchebychef's 
inequality (Hardy, Littlewood, and Polya, 1988, p. 43 and 168), we obtain 

/ F{x)G{x)dx < [ F{x)dx [ G{x)dx. 
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Since F > and G > 0, it follows that 



\nx\!^^F{x)G{x)d. 



.X 



< 1, 



which along with HA.13() indicates that /iampec(92) < ^amise- 

To verify part (b), it can be seen that under its assumptions, for a constant C > 0, F[x) = 
C /\Q.x\b" {6{x)) is oppositely ordered with G{x) = {b" {6{x))}^'^ , and thus the conclusion of part (a) 
immediately indicates the upper bound 1. To show the lower bound, we first observe that (|A.13() 
becomes 



^AMPEc(g2) 
^AMISE 



Lb"{e{x))dx L{h"{e{x))]-Hx 



l/(2p+3) 



(A.14) 



Incorporating the Griiss integral inequality (Mitrinovic, Pecaric, and Fink 1993), 



x\ Jn 



F{x)G{x)dx 



I ,„ , F{x)dx / G{x)dx 



< -{MF-mF){MG-mG), 



where Mp = max^g^^ F{x), nip = min^^n^ F{x), Mq = maxa-g^x G{x), and niG = miUa-gQ^ G{x) 
we deduce that 

/ b'\e[x))dxf {5-(g(x))}-^cix< ^7 + y"^V xp. 
J^x Jnx Ami,„Mi,„ 

This applied to (|A.14|) gives the lower bound. 



Proof of Proposition |1| 

Define H = diag{l, h, . . . , hf}. From HA.12|) . we have 

= eJ{Sn{X,)/b"{e{X,))}-^eiKh{0) 

= n-^e^-R-^{n-^U-^Sn{X,)/b"(e{X,))U-^}-^Il-^eiKh{0) 
= inh)-^eJ{n-^-ll-^Sn{Xi)/b"{e{Xi))U-^}-^eiK{0), 



(A.15) 



where = ^^^^Xj{x)xj{x)^ Kh{Xj — x)6"(xj(x)^/3(x)). By Taylor's expansion and the conti- 

nuity assumptions on b" and fx, it follows that for x S Qx, 

n-iH-iS„(x)/6"(0(x))H-i = + op(l). 

Combining this expression with ()A.15|) . it can be shown that 

~( ~[ nhfxiXi) nh ^ fx{Xi) 



which will finish the proof. 
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Lemma 3 Assume that the kernel function K is non-negative, symmetric and uni-modal. Then for 
i = 1, . . . ,n, Si is a decreasing function of h > for which Si is well-defined. 

Proof: Consider the matrices Ai{h) = X{Xifdiag{K{\Xj - Xi\/h)}]^-^X{Xi), i = l,...,n. UK 
is non-negative and uni-modal, then < /ii < /12 imphes that Ai(hi) < Ai(h2) or, equivalently, 
{Ai{hi)}^^ > {Ai{h2)}'^ . We complete the proof by noting Si = eJ{Ai{h)}^^eiK{0), since K is 
symmetric. 

Proof of Proposition [H 

The one-step estimate of /3 (x), starting from Pq = /3(x), is given by 

3(x) + 4{X(2;)^Vi(0)K(2;)X(x)}-Hx(rE)^Vi(0)W(x; P{x))z{x)}, (A.16) 

i.e., P{x) - 4:{Sn{x) - Kh{Xi - x)y.i{x)yii{xy]~^yii{x)Kh{Xi - x){Yi - mj(x)}. Again, using the 
Sherman-Morrison- Woodbury formula (Golub and Van Loan 1996, p. 50), 

T^~l _ i Q / , Kh{Xi - x){Sn{x)}~''yLi{x)yii{x)'^ {Sn{x)}~^ 



{Sn{x) - Kh{X, - X)X,(X)X,(X)^ = {Sn{x)}'' + 



1 - Kh{Xi - x)x,(x)^{5n(x)}-ixi(x) ' 
and thus 

{Sn{x) - Kh{Xi - x)Xj(x)Xi(x)'^}"^Xj(x) = {Sn{x)}~^Xi{x) 

KhiXi - x){Sn{x)}~'^Xi{x)xi{x)'^{Sn{x)}-'^Xi{x) {Sn{x)}-'^Xi{x) 



1 - KhiXi - x)xi(x)^{5„(x)}-ixi(x) 1 - Kh{X^ - x)x,(x)r{S„(x)}-ix,(x) 

by which HA.16|) becomes 

^Sn{x)}-'l^^{x)Kh{Xi - X){Y, - m{x)} 



f3{x) 



1 - Kh{X^ - x)x,(rE)^{5„(x)}-ixi(x) 



This expression approximates f3 (x) and thus leads to (|4.H|) . 
Applying ()4.3() . we have 

e-' -e, = eJ0~\xi)-p{Xi)} 

. 4el{SniXi)}-^eiKhmYi-mi) 45, 



{Yi - fhi), 



I -Si 1-5, 
leading to 1)4. 4|) . Proofs of ()4.5() and ()4.6() are similar to those of Proposition [21 

Proofs of Propositions 

The technical arguments are similar to the proofs of Propositions ^^21 and thus details are omitted. 
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Proof of Proposition [HI 

Recalling the definition of H* in Section [5.21 we have that 

H* = (ei®X,f{5:(?7i;3(t/i))}^'(ei®X,) {i^^0)6"(^(?7,,X0)} 

= (n/i)-i(ei ® Xif{n~\U Irf)-i5*(C/i)(H ® Id)-i}-i(ei ® X,)K(0)6"(^(i7„ X,)), (A.17) 

where 

n 

Sniu) = ^{Uj{u) (g) Xj}{uj{u) (g) Ijj^KhiUj - u)b"{{uj{u) ^ Xj}^3W) 

i=i 

n 

= [{uj{u)uj{uf} {XjlJ)]Kh{Uj - u)b"{{uj{u) X,}^3(n)). 

It can be shown that for u £ rifj, 

n-i(H®Irf)-i5:H(H®Irf)-i 

n 

= [{U-%{u)uj{ufU-^} (XjXj)]i^/,(;7j - u)b"{{uj{u) (g> Xj}^P{u)) 

n 

= [{H-^Uj(^x)uj(tx)^H-^} (X,xJ)]i^,,(;7,- - n)6"(0(n, X,)) + op{l) 

= /c/(^x)[5 E{b"{e{u, x))xx^|[/ = u}] + op(i) = r(u)} + op(i). 

This expression applied to ()A.17|) further implies that 

n n ^ 

= E .,,^ (elCjDX,^g-^^m)-^}(el0X,)i^(O)6^^(g([/,,X,)){l + op(l)} 

" 1 

= E ^/,/^(^^) (^^^''^i)-^W{^^r([/0-^x,b--(g([/,,xO)}{i + op(i)} 

= E {xfr(^.)-^X,fe--(g([/,, X,))}{1 + op(l)}. (A.18) 

For ()A.18|) . a direct calculation gives that 

E{jFT{urHb"{e{u,T))/ fu{u)] = t,[E{T{urHi^b"{e{u,t))/ fu{u)]] 

= i,{E[nU)-^E{b"{e{U,t))JX^\U]/fu{U)\] 

= tT[E{r{ur'riu)/MU)}] 
= ti{\nu\id) = d\nu\. 

This completes the proof. 
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Table 1: The Asymptotic Optimal Bandwidths /iAMPEc(92) Calculated From 1^3. 12\) and /iamise 
From 1^3. 13\) . with p = 1, the Epanechnikov Kernel, and Examples Given in Section \7.1\ 



Exjjonential familv 


Example 


n A A/TPFP ( Q'? ) 


'^rt.lVilO-Cj 


Poisson 


1 


.070 


.079 




2 


.089 


.099 




3 


.127 


.136 


Bernoulli 


1 


.106 


.108 




2 


.151 


.146 




3 


.184 


.188 



Table 2: Choices of a and C, in the Empirical Formulas 1^3. 14\) and i)5.4l) . for the pth Degree Local 
Polynomial Regression for Gaussian Responses 



Design type 


P 


a 


C 


Design type 


P 


a 


C 


fixed 





0.55 


1 


random 





0.30 


0.99 




1 


0.55 


1 




1 


0.70 


1.03 




2 


1.55 


1 




2 


1.30 


0.99 




3 


1.55 


1 




3 


1.70 


1.03 
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Figure 1: Illustration of Margin-Based Loss Functions. Line types are indicated in the legend box. 
Each function has been re-scaled to pass the point (0, 1). 
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Figure 2: Illustration of the Bregman Divergence Q{Y,m.) as Defined in \2.'2^ . The concave curve 
is the q-function; the two dashed lines give locations of Y and fh; the solid strict line is q{m) + 
g'(m)(y — in); the vertical line with arrows at each end is Q{Y,fh). 
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(a) Poisson, Example 1 



(b) bandwidth selectors 



(c) estimated e(x) 
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(d) Poisson, Example 2 
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(g) Poisson, Example 3 
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Figure 3: Evaluation of Local-Likelihood Nonparametric Regression for Poisson Responses. Left 
panel: plots of ^[^j^ Hi versus h. Dots denote the actual values, centers of circles stand for the em- 
pirical values given by 1^3.14]) . for local-linear smoother with a = .70 and C = 1.03. Middle panels [fig- 
ures (b), (e), and (h)]: boxpiote of {/iECV-^AMPEc(Q'2)}//iAMPEc(Q'2) and {hECY - hAMiSE} /hAMiSE- 
Panels (c), (f), and (i): estimated curves from three typical samples are presented corresponding to 
the 25*^ (the dotted curve), the 50*'' (the dashed curve), and the 75^^ (the dash-dotted curve) per- 
centiles among the ASE-ranked values. The solid curves denote the true functions. 
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(a) Bernoulli, Example 1 



(b) bandwidth selectors 



(c) estimated e(x) 
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(d) Bernoulli, Example 2 



(e) bandwidth selectors 



(f) estimated 9(x) 





0.05 0.1 0.15 0.2 



MPEC 



MISE 



-2 



-4 















/ A 








\ 













0.5 



(g) Bernoulli, Example 3 (h) bandwidth selectors (i) estimated e(x) 
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Figure 4: Evaluation of Local-Likelihood Nonpar ametric Regression for Bernoulli Responses. Cap- 
tions are similar to those for Figure Here /iecv minimizes the empirical version of 1^4. y\) : the 
empirical formula i|,-^.I4|) uses a = .70 and C = 1.09 for Hi and a = .70 and C = 1.03 for Si. 
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(a') Poisson, Example 2 (b') estimated a^(u) 
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Figure 5: Evaluation of Local-Likelihood Varying CoefEcient Regression for Poisson Responses. Pan- 
els (a) and (a'): plots ofY^^=iH* versus h. Dots denote the actual values, centers of circles stand 
for the empirical values given by f5.4j) . for local-linear smoother with a = .70 and C = 1.03. Panels 
(h)-(c) for Example 1 and panels (b')-(d') for Example 2: estimated curves from three typical sam- 
ples are presented corresponding to the 25^^ (the dotted curve), the 50*'* (the dashed curve), and the 
75^^ (the dash-dotted curve) percentiles among the ASE-ranked values. The solid curves denote the 
true functions. 
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(a) Bernoulli, Example 1 



(b) estimated a^ (u) 
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(a') Bernoulli, Example 2 (b') estimated a^ (u) 
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Figure 6: Evaluation of Local-Likelihood Varying CoefEcient Regression for Bernoulli Responses. 
Captions are similar to those for Figure\^ Here /iecv minimizes the empirical version of i)5.5|) : the 
empirical formula i)5.4|) uses a = .70 and C = 1.09 for H* and a = .70 and C = 1.03 for S* . 
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(b) ACV for fitting E(X2|U=u) 



(c) ACV for fitting varying functions 
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Figure 7: Applications to the Job Grade Data Set Modeled by i)8.I|) . (a) scatter plot of work 
experience versus age along with a local linear Et; (b) plot of the approximate CV function against 
bandwidth for the local linear fit in (a); (c) plot of the approximate CV function, defined in i|5.5|) . 
against bandwidth for fitting varying coefRcient functions; (d) estimated ao{u); (e) estimated ai{u); 
(f) estimated 02 (u). The dotted curves are the estimated functions plus/minus 1.96 times of the 
estimated standard errors. 
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(a) 



(b) ACV for fitting E(X2|U=u) 
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(c) ACV for fitting varying functions 
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Figure 8: Applications to the Job Grade Data Set Modeled by i|8.ij) . Captions are similar to those 
for Figure\^ except that U is WorkExp and X2 is the de-correlated Age. 
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